The authors define decision trees as a way to generate a prediction carrying out a series of tests on the values of the descriptive features describing a query instance, and use the answers to determine a prediction.
Descriptive features are used to make a test at the root node, the outcome of this test will choose which child node to the root node will be called, this process continues until a leaf node is reached and a finall prediction is made. The preference in building decision tree models is to build a shallow tree, or one which arrives at the expected level of accuracy with as few as nodes as possible. To determine which descriptive features should be used for which nodes, we need a concrete understanding of Shannon's Entropy Model.
"Claude Shanon's entropy model defins a computational measure of the impurity of the elements in a set." - Page 123.
A basic way to think about entropy is the uncertainty that you could guess a result if you made a random selection from a set. Sets which contain all of the same items have zero entropy, as you will always guess the correct result, however sets which have more unique items will have higher entropy as the chance that you will guess the correct result is higher.
Outcomes with a large probability have a low entropy, and outcomes with a low probability have a large entropy. The relationship can be defined using a logarithmic function.
What is the difference between entropy and probability?
from math import log2
import numpy as np
import plotly.express as px
p_space = np.linspace(0.01,.99, num=100)
y = np.log2(p_space)
fig = px.line(x=p_space,y=y)
fig.update_layout(xaxis_title='P(X)',yaxis_title='log2P(X)')
fig.show();
%matplotlib inline
fig = px.line(x=p_space,y=y*-1)
fig.update_layout(xaxis_title='P(X)',yaxis_title='-log2P(X)')
fig.show()
An example from the book illustrated in python is to calulate the entropy of different scenarios using a deck of cards. The first scenario is to calculate the entropy of a scenario where you are trying to select a single random card. The lower the probability the higher the entropy. For instance, below we calcuate the probability of selecting a single random card is .019. The entropy of this scenario is the negative log 2 of this probability which is 5.7.
card_values = ["two","three","four","five","six","seven","eight","nine","ten","jack","queen","king","ace"]
card_suites = ["spades","clubs","hearts","diamonds"]
deck = []
for c in card_values:
for s in card_suites:
deck.append((c,s))
print("The deck has {} cards.".format(len(deck)))
p_card_suite = 1/len(deck)
print("The probability of pulling a unique card is {}.".format(p_card_suite))
print("The entropy of this set is {}".format(-np.log2(1/len(deck))))
The deck has 52 cards. The probability of pulling a unique card is 0.019230769230769232. The entropy of this set is 5.700439718141092
p_suite = 1/len(card_suites)
entropy_suite = -np.log2(p_suite)
print("The probability of pulling a unique suite {}.".format(p_suite))
print("The entropy if we are considering pulling a unique suite is {}.".format(entropy_suite))
The probability of pulling a unique suite 0.25. The entropy if we are considering pulling a unique suite is 2.0.
Information Gain - is defined as a measure of the reduction in the overall entropy of a set of instances that is achieved by testing on a descriptive feature. The book defines a three step process for computing information gain:
Page 131 and 132 break down the formula for iterating through feature sets to get information gain, and are good references to point back to. When computing information gain the higher the bit the more discriminatory it is, which is good for classification problems.
"A key part of doing this is being able to decide which tests should be included in the sequence and in what order. The information gain model we have developed allows us to decide which test we should add to the sequence next because it enables us to select the best feature to use on a given dataset. In the next section, we introduce the standard algorithm for growing decision trees in this way." - Page 132
Below I've taken the data set that they provide in the chapter, and created a pandas dataframe. Using this data I'll walk through information gain using python.
import pandas as pd
df = pd.read_csv('chapter_four_spam.csv')
df
| ID | Suspicious Words | Unknown Sender | Contains Images | Class | |
|---|---|---|---|---|---|
| 0 | 376 | True | False | True | SPAM |
| 1 | 489 | True | True | False | SPAM |
| 2 | 541 | True | True | False | SPAM |
| 3 | 693 | False | True | True | HAM |
| 4 | 782 | False | False | False | HAM |
| 5 | 976 | False | False | False | HAM |
First we will want to understand the entropy of the entire set, we have two target classes, SPAM and HAM, so to compute the entropy of the data set with these two target classes is as follows:
spam_entropy = (3/6 * np.log2(3/6))
spam_entropy
-0.5
ham_entropy = (3/6 * np.log2(3/6))
ham_entropy
-0.5
-(ham_entropy + spam_entropy)
1.0
A more succinct way of writing this for any target set in a dataframe would be:
target_classes = ['SPAM','HAM']
df = df[df['Class'].isin(target_classes)]
entropy = []
for t in target_classes:
class_prob = (df[df['Class']==t].shape[0] / df.shape[0])
class_entropy = (class_prob * np.log2(class_prob))
entropy.append(class_entropy)
total_entropy = -sum(entropy)
print(entropy,total_entropy)
[-0.5, -0.5] 1.0
Now that we have the entropy of entire data set given the two target classes, we would then compute the entropy after we split the data by each descriptive feature.
Say we split the data into a subset looking at only suspicious words, we have three occurrences where suspicious words is true and of those three occurrences have a target class of SPAM, none of HAM.
def compute_entropy_remain(feature,classes,df):
# get true dataframe
true_df = df[df[feature]==True]
# get false dataframe
false_df = df[df[feature]==False]
# get size of entire data set
dataset_size = df.shape[0]
# for subset of data where the feature is true
# compute the total number of true out of the total data set size
true_weight = (true_df.shape[0] / dataset_size)
# for subset of data where featuer is false
# compute the total number of false out of the total data set size
false_weight = (false_df.shape[0] / dataset_size)
# for true - compute entropy given the different classes
true_entropies = []
# for false - compute entropy given the different classes
false_entropies = []
for c in classes:
# true entropies
entropy_prob = true_df[true_df['Class']==c].shape[0] / true_df.shape[0]
if entropy_prob != 0:
class_entropy = entropy_prob * np.log2(entropy_prob)
else:
class_entropy = 0
true_entropies.append(-class_entropy)
print(sum(true_entropies))
return sum(true_entropies)
word_entropy = compute_entropy_remain('Suspicious Words', ['SPAM','HAM'],df)
print("Information Gain is ",total_entropy - word_entropy)
0.0 Information Gain is 1.0
sender_entropy = compute_entropy_remain('Unknown Sender', ['SPAM','HAM'],df)
print("Information Gain is ",total_entropy - sender_entropy)
0.9182958340544896 Information Gain is 0.08170416594551044
image_entropy = compute_entropy_remain('Contains Images',['SPAM','HAM'],df)
print("Information Gain is ",total_entropy - image_entropy)
1.0 Information Gain is 0.0
Based on the above information, we have a descriptive feature that can be used to completley determine the target class of SPAM or HAM, this typically wouldn't be the case, but showcases how you would use information gain to split data sets up by features to determine a target class.
ID3 is an approach for building the shallowest decision tree possible using information gain. By first computing the information gain of the descriptive features in the training set the tree selects the best first feature to use, the one with the highest information gain. A root node is added to the tree and the data set is partitioned based on the outcome of that first classification node. This process is then repeated removing the selected feature and data used to determine that root node.
The process is completed until all the instances in a partition have the same target level at which point a leaf node is created and labeled with that level.
Instead of using entropy as the measure of impurity which is used to determine which partitions happen in which order, we can use what is called the information gain ratio which takes the information gain and divides it by the amount of information used to determine the value of the feature. This removes a bias against selecting features further up in the root tree that split the data into smaller subsets.
"The GINI index can be understood as calculating how often the target levels of instances in a dataset would be misclassified if predictions were made on the sole basis of the distribution of the target levels in the dataset." - Page 145.
Decisions trees can overfit data as trees get deeper because the resulting predictions are based on increasing levels of partitions. Partitions could occur because of noise in the data or sampling variance.
Tree Pruning attempts to remove subtrees within a decision tree that are due to noise or sample variance. This helps us manage the problem of over fitting which gives us poor generalization.
Pre-pruning is creating an early stopping criteria that will prevent the tree from creating subtrees at a certain threshold, for instance when information gain falls below a certain number, or using more advances mechanisms statistical capabilities like chi squared.
"By stopping the partitioning of the data early, however, induction algorithms that use pre-pruning can fail to create the most effective trees because they miss interactions between features that emerge within subtrees that are not obvious when the parent nodes are being considered. Pre-pruning can mean that these useful subtrees are never created." Page - 155.
Post-pruning - Unlike pre-pruning, the tree is allowed to grow to completion, then branches that could cause overfitting are subsequently removed. The authors recommend the use of criteria that compare the error rate in the predictions made by a decision three when a given subtree is included and when it is pruned.
Reduced error pruning - Trees are allowed to complete, then error rates are reviewed in a bottom up method. "The error rate resulting from predicitions for the instances in the validation dataset made at the root node for each subtree is compared to the error rate resulting from predictions made at the leaves of the subtree. If the error rate at the subree node is less than or equal to the combined error rate at the leaves, the subtree is pruned."
A model ensemble is a model that is composed of a set of models. The defining characteristics of an ensemble model are:
Bagging or bootstrap aggregating uses a random sample with replacement, which is called a bootstrap sample. One model is induced from each bootstrap sample. This has the affect of creating different samples. Decision trees are very sensitive to data in the training phase, as a different data set can cause the creation of different root nodes having significant ripple affects down the tree.
Models which combined bagging, subspace sampling and decision trees are known as random forest.
Boosting is done by focusing on miss-classified pieces of data. Gradient Boosting iteratively trains prediction models trying to make later models specialize in areas that earlier models struggled with. It focuses on the errors that is created in each model. It predicts errors made in the base model, and adds models where each model is trained to reduce prediction error. Typically you will get a larger number of shallow trees, similar to random forest.